White Wine Exploratory Data Analysis by Irene Florez

Introduction

This project is a quick exploratory analysis project using R. The aim of the project is to use R to explore the relationships between data features.

We will be using a wine data set available at http://www3.dsi.uminho.pt/pcortez/wine/.

‘White Wine Quality’ is a tidy dataset which contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

We will use univariate, bivariate, and multivariate analyses to explore the relationships between the data features and to tease out the quality rating. You can read the final summary and reflection at the end of this document.

setup document

load packages

load the data

data variables

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Dataset dimensions: There are 12 variables & a total of 4898 observations

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Dataset structure: All of the data observations are num or int, there are not factor data types

data dimensions

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Dataset features: We will review quality, alcohol, sulphates, density, and sugar.

Univariate Plots Section

quality

## NULL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality has a roughly normal bell shaped curve distribution Quality values are distributed between 3-9 The mean is 5.88 and the median is 6.00. The largest frequency scored 6(44.88%) and a small number of wines scored 9(0.1%).

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol values range from 8-14.20 The avarage alcohol value is 10.51 The largest group by frequency, has a 9-9.5 alcohol count.

sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphate values range from .22 to 1.08 The average sulphate value is .48 The majority of wines have between .3 and .6 sulphate content.

density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

White wine density falls between .98 - 1.03, with a mean of .99 The density content is closely grouped together for the majority of wines. The first histogram is shifted to the left, which means density variable has at least one outlier.

residual_sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The majority of white wines in the dataset have a low residual sugar level. As with density, the histogram of the amount of residual suger is shifted to the left. This means residual sugar may contain outliers.

Bivariate Plots Section

Density and alcohol have the strongest relationship Alchohol has a negative relationship with density.

alcohol & quality

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

Now let’s remove the outliers

Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content.

sulphates & quality

Now let’s remove outliers

Generally, as quality increases, sulphate content increases; but decreases when it reaches a 7.5 quality score.

alcohol & sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  sulphates and alcohol
## t = 0.35834, df = 1559, p-value = 0.7201
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04055754  0.05866309
## sample estimates:
##         cor 
## 0.009075112

Alcohol and sulphates have a weak relationship.

density, & sugar

Density and residual sugar contents are both low count variables. Generally, the higher the density, the higher the residual sugar content.

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Multivariate Plots Section

alcohol, sulphates, & quality

alcohol, density, & quality

Findings: Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. Alchohol has a negative relationship with density. Wine density has a negative relationship with the quality. wines with a higher score tend to have lower density.

Process: We can see that there are outliers in the density variable. We will remove them for the final plots portion. We will also add a darker background, for contrast. This more clearly highlights the difference between good and bad wines (because a neutral color is used for the OK wines) and the levels in quality are highlighted by the gradation in color.

Because quality is ordinal, we set ggplot() to use sequential or divergent color encoding (this is optimal because it gives a sense of gradation to the different levels in the data) - and not qualitative color encoding (which is for general discrete variables).

Where sequential color encoding is used for pure ordinal discrete data and divergent color encoding is used if the data is both ordinal and follows a diverging scale (think “Good, Ok, Bad” - which can be viewed as appropriate for this dataset). For example, for the plot above: (where “RdYlBu” is a specific divergent color scheme, the name option changes the legend title and direction=-1 changes the order of the colors)

Final Plots and Summary

plot one

## $title
## [1] "Relationships & Correlations"
## 
## attr(,"class")
## [1] "labels"

Alcohol and density have the strongest relationshop; these two are neagatively related. Density and alcohol are the strongest determinants of quality.

The amount of residual sugar has a weak relationship wih the quality of wine.

plot two

There’s a high concentration of 5-7 quality wines… And these largely have alchohol levels between 8.5-13.5

plot three

## $title
## [1] "Relationship & distribution of alcohol, density, and quality"
## 
## attr(,"class")
## [1] "labels"

Findings: Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. Alchohol has a negative relationship with density. Density of wine has a negative relationship with the qaulity of wine.

Process: We removed the outliers in the density variable and added a darker background, for contrast. This more clearly highlights the difference between good and bad wines (because a neutral color is used for the OK wines) and the levels in quality are highlighted by the gradation in color.

Because quality is ordinal, we set ggplot() to use sequential or divergent color encoding (this is optimal because it gives a sense of gradation to the different levels in the data) - and not qualitative color encoding (which is for general discrete variables).

Where sequential color encoding is used for pure ordinal discrete data and divergent color encoding is used if the data is both ordinal and follows a diverging scale (think “Good, Ok, Bad” - which can be viewed as appropriate for this dataset). For example, for the plot above: (where “RdYlBu” is a specific divergent color scheme, the name option changes the legend title and direction=-1 changes the order of the colors)

Reflection

EDA results: We reviewed six features in this data set: quality, alcohol, density, sulphates, and residual sugar. 1) Density is the best predictor of quality. Higher quality wines tend to have lower density and higher alcohol content. 2) Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. 3) Sulphates and alchohol content generally do not have a strong relationship. 4) The amount of residual sugar has a weak relationship wih the quality of wine. Overall, higher quality wines tend to have low residual sugar content. Generally, as density increases, residual sugar will also increase

Process reflection: Here we used plotting (scatterplots, histograms, boxplots, and line graphs).

In the future, I would like to do a correlation matrix using all of the variables. I would also like to review ratios such as quality / alcohol. Secondly, I would like to do a deeper dive into alcohol and density. Lastly, I would like to create a model which with observations can predict wine quality with some degree of confidence. And overall, I would like more practice with transparency, jitter, smoothing, and limiting axes.